Assignment 11 uses the Maftools package Mayakonda et al. (2018) in R to demonstrate the data from the TCGA database, including ESCA, AML Stoll and Akbani (2013), and BRCA cancer projects. In the following sections, I will explain the code chunk individually and how the data hook up with the functions.
This Section has two sub-sections. The first one loads the Maftools package, which already has the TCGA data. We can extract the maf file in maf.gz and the annotation from .tsv file separately to create a MAF file object. By this, we can further handle through default functions such as getSampleSummary and getGeneSummary to quickly access specific parts of data for an overview.
## -Reading
## -Validating
## -Silent variants: 475
## -Summarizing
## -Processing clinical data
## -Finished in 0.389s elapsed (0.364s cpu)
Second section illustrates the functions that skim the MAF object, which has multiple datasets such as genes and sample data; the clinical data are TCGA samples annotating related cancer stages and survival information. The Frame Shift are nucleotides that change the reading frame, resulting in a non-functional protein; in contrast, the In Frame insertion and deletion do not alter the reading frame but can only affect the protein sequence. We can apply the “write.mafSummary” function to export an object. In the TCGA database use case, we can use GDCdownload and GDCprepare to set up our study database after selecting the necessary data for analysis.
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build 37 NA NA
## 2: Center genome.wustl.edu NA NA
## 3: Samples 193 NA NA
## 4: nGenes 1241 NA NA
## 5: Frame_Shift_Del 52 0.269 0
## 6: Frame_Shift_Ins 91 0.472 0
## 7: In_Frame_Del 10 0.052 0
## 8: In_Frame_Ins 42 0.218 0
## 9: Missense_Mutation 1342 6.953 7
## 10: Nonsense_Mutation 103 0.534 0
## 11: Splice_Site 92 0.477 0
## 12: total 1732 8.974 9
## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: TCGA-AB-3009 0 5 0
## 2: TCGA-AB-2807 1 0 1
## 3: TCGA-AB-2959 0 0 0
## 4: TCGA-AB-3002 0 0 0
## 5: TCGA-AB-2849 0 1 0
## 6: TCGA-AB-2923 1 1 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 1 25 2 1 34
## 2: 0 16 3 4 25
## 3: 0 22 0 1 23
## 4: 0 15 1 5 21
## 5: 0 16 1 2 20
## 6: 0 15 3 0 20
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1: FLT3 0 0 1 33
## 2: DNMT3A 4 0 0 0
## 3: NPM1 0 33 0 0
## 4: IDH2 0 0 0 0
## 5: IDH1 0 0 0 0
## 6: TET2 10 4 0 0
## Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1: 15 0 3 52 52
## 2: 39 5 6 54 48
## 3: 1 0 0 34 33
## 4: 20 0 0 20 20
## 5: 18 0 0 18 18
## 6: 4 8 1 27 17
## AlteredSamples
## 1: 52
## 2: 48
## 3: 33
## 4: 20
## 5: 18
## 6: 17
## Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1: TCGA-AB-2802 M4 365
## 2: TCGA-AB-2803 M3 792
## 3: TCGA-AB-2804 M3 2557
## 4: TCGA-AB-2805 M0 577
## 5: TCGA-AB-2806 M1 945
## 6: TCGA-AB-2807 M1 181
## Overall_Survival_Status
## 1: 1
## 2: 1
## 3: 0
## 4: 1
## 5: 1
## 6: 1
## [1] "Hugo_Symbol" "Entrez_Gene_Id" "Center"
## [4] "NCBI_Build" "Chromosome" "Start_Position"
## [7] "End_Position" "Strand" "Variant_Classification"
## [10] "Variant_Type" "Reference_Allele" "Tumor_Seq_Allele1"
## [13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode" "Protein_Change"
## [16] "i_TumorVAF_WU" "i_transcript_name"
The maftools provides multiple visualization function that fits the cancer genomic datasets TCGA nicely. Here we make six subsections to discuss each plot function and its parameters individually.
The first “plotmafSummary” takes MAF obj input and generates a dashboard that provides an overview of various quality control metrics and mutation signatures. The bol value of rmOutlier, dashboard, and titvRaw shows if you want to remove the outlier from data, show the dashboard summary or calculate the raw Ti/Tv ratio (transitions (Ti) to transversions (Tv) in a DNA sequence) to evaluate the quality of variant calling. Finally, if applicable, the addStat is an option to add statistical references, such as the median on the plot.
The “oncoplots” select the top 10 genes with the most mutation samples and illustrate its tumor mutational burden on the top and annotation of explanation mutation type, for example, In Frame/Frame Shift/Nonsense on the location, and shows where to be more likely to respond to immune checkpoint inhibitors.
The “plotTiTv” first uses the titv function to compute the transition-to-transversion (Ti/Tv) ratio for somatic variants in a given MAF file, a measure of the types of substitutions that occur in a dataset of DNA mutations. Then, the plotTiTv plots the result accordingly.
The Lollipop plots investigate a particular gene on its mutational location and frequency. Here we focus on the DNMT3A gene, showing its mutation rate. The “AACol” parameter can specify the mutation(s) location in the MAF file by Protein Change column. The “labelPos=882” is the y-axis position of the mutation label in the lollipop plot.
The “plotProtein” function inputs a gene name and reference sequence ID. The reference sequence ID retrieves the protein sequence from a database. It generates a plot that shows the location of exons, domains, and post-translational modifications on the protein sequence.
## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_022552 NP_072046 912
## 2: DNMT3A NM_153759 NP_715640 723
## 3: DNMT3A NM_175629 NP_783328 912
## Using longer transcript NM_022552 for now.
## Removed 3 mutations for which AA position was not available
The rainfall plot shows the density of mutations across the genome. The x-axis is chromosomes, and the y-axis represents the density of mutations at that position. The “pointSize = 0.4” parameter sets the size of the points on the plot. The “detectChangePoints = TRUE” parameter in this example enables the detection of significant change points and highlights those regions with black arrows.
## Processing TCGA-A8-A08B..
The tcgaCompare function inputs the laml MAF file object and specifies the cohort name “Example-LAML” to compare with others in the TCGA database. Thus, the x-axis is all the cancer cohort, the y-axis is the tumor mutational burden (TBM) per MB, and the y-axis The capture size =50 argument sets the threshold that includes only the capture size greater than 50 for the analysis. The “logscale = TRUE” argument specifies the TBM in a logarithmic scale.
In the “plotvaf” function, create a plot of the variant allele frequency (VAF) distribution for the LAML dataset. The vafCol = ‘i_TumorVAF_WU’ argument specifies which column in the MAF file to use for the VAF data. The x-axis shows the VAF values on genes, and the y-axis illustrates the density of mutations on a scale of 0 to 1. As we can see, the DNMT3A with a high VAF value may suggest clonal mutations.
## Warning in FUN(X[[i]], ...): Removed 1 samples with zero mutations.
## Capture size [TCGA]: 35.8
## Capture size [Input]: 50
## Performing pairwise t-test for differences in mutation burden (per MB)..
In the data analysis section, we will further explore the analysis funciton in the following subsections.
First, the somaticInteractions generate a network plot of significant co-occurring and mutually exclusive gene mutations in a given MAF file. The green means highly correlated, and the asteroid sign indicates its p-value as the confidential interval of correlation assumption. On the other hand, The dark brown color represents the two independent gene mutations.
## gene1 gene2 pValue oddsRatio 00 11 01 10 pAdj
## 1: ASXL1 RUNX1 0.0001541586 55.215541 176 4 12 1 0.003568486
## 2: IDH2 RUNX1 0.0002809928 9.590877 164 7 9 13 0.006055880
## 3: IDH2 ASXL1 0.0004030636 41.077327 172 4 1 16 0.008126283
## 4: FLT3 NPM1 0.0009929836 3.763161 125 17 16 35 0.018664260
## 5: SMC3 DNMT3A 0.0010451985 20.177713 144 6 42 1 0.018664260
## ---
## 296: PLCE1 ASXL1 1.0000000000 0.000000 184 0 5 4 1.000000000
## 297: RAD21 FAM5C 1.0000000000 0.000000 183 0 5 5 1.000000000
## 298: PLCE1 FAM5C 1.0000000000 0.000000 184 0 5 4 1.000000000
## 299: PLCE1 RAD21 1.0000000000 0.000000 184 0 5 4 1.000000000
## 300: EZH2 PLCE1 1.0000000000 0.000000 186 0 4 3 1.000000000
## Event pair event_ratio
## 1: Co_Occurence ASXL1, RUNX1 4/13
## 2: Co_Occurence IDH2, RUNX1 7/22
## 3: Co_Occurence ASXL1, IDH2 4/17
## 4: Co_Occurence FLT3, NPM1 17/51
## 5: Co_Occurence DNMT3A, SMC3 6/43
## ---
## 296: Mutually_Exclusive ASXL1, PLCE1 0/9
## 297: Mutually_Exclusive FAM5C, RAD21 0/10
## 298: Mutually_Exclusive FAM5C, PLCE1 0/9
## 299: Mutually_Exclusive PLCE1, RAD21 0/9
## 300: Mutually_Exclusive EZH2, PLCE1 0/7
To detect the driver genes of cancer, we can use the “oncodrive” function that takes MAF file as input, and the AACol parameter specifies the mutation(s) location/gene in the MAF file by Protein Change. The “minMut” is the threshold to pick up the gene above specific mutation numbers to be considered a potential driver, and pvalMehtod is for calculating the statistical significance of the observed mutations in each gene.
The “plotOncodrive” function generates a volcano plot of the “oncodrive” results. The “fdrCutOff” sets the FDR method’s threshold for the adjusted p-value to avoid a false discovery rate (FDR) when determining significance. The “useFraction” decides whether to use the fraction of mutated samples in each cluster as the x-axis or the number of mutations. The “labelSize” controls the minimum size of the labels for each gene in the plot.
##
|
| | 0%
|
|=== | 4%
|
|====== | 9%
|
|========= | 13%
|
|============ | 17%
|
|=============== | 22%
|
|================== | 26%
|
|===================== | 30%
|
|======================== | 35%
|
|=========================== | 39%
|
|============================== | 43%
|
|================================= | 48%
|
|===================================== | 52%
|
|======================================== | 57%
|
|=========================================== | 61%
|
|============================================== | 65%
|
|================================================= | 70%
|
|==================================================== | 74%
|
|======================================================= | 78%
|
|========================================================== | 83%
|
|============================================================= | 87%
|
|================================================================ | 91%
|
|=================================================================== | 96%
|
|======================================================================| 100%
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1: IDH1 0 0 0 0
## 2: IDH2 0 0 0 0
## 3: NPM1 0 33 0 0
## 4: NRAS 0 0 0 0
## 5: U2AF1 0 0 0 0
## 6: KIT 1 1 0 1
## Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1: 18 0 0 18 18
## 2: 20 0 0 20 20
## 3: 1 0 0 34 33
## 4: 15 0 0 15 15
## 5: 8 0 0 8 8
## 6: 7 0 0 10 8
## AlteredSamples clusters muts_in_clusters clusterScores protLen zscore
## 1: 18 1 18 1.0000000 414 5.546154
## 2: 20 2 20 1.0000000 452 5.546154
## 3: 33 2 32 0.9411765 294 5.093665
## 4: 15 2 15 0.9218951 189 4.945347
## 5: 8 1 7 0.8750000 240 4.584615
## 6: 8 2 9 0.8500000 976 4.392308
## pval fdr fract_muts_in_clusters
## 1: 1.460110e-08 1.022077e-07 1.0000000
## 2: 1.460110e-08 1.022077e-07 1.0000000
## 3: 1.756034e-07 8.194826e-07 0.9411765
## 4: 3.800413e-07 1.330144e-06 1.0000000
## 5: 2.274114e-06 6.367520e-06 0.8750000
## 6: 5.607691e-06 1.308461e-05 0.9000000
The pfamDomains function creates a pfam domain object that identifies the Pfam domains affected by mutations in the input MAF file. The AACol parameter specifies the column containing the protein change information, and the top parameter limits the number of top-ranked Pfam domains to display. The resulting object contains two data frames: proteinSummary, which summarizes the mutations per protein, and domainSummary, which summarizes the mutations per Pfam domain.
## HGNC AAPos Variant_Classification N total fraction DomainLabel
## 1: DNMT3A 882 Missense_Mutation 27 54 0.5000000 AdoMet_MTases
## 2: IDH1 132 Missense_Mutation 18 18 1.0000000 PTZ00435
## 3: IDH2 140 Missense_Mutation 17 20 0.8500000 PTZ00435
## 4: FLT3 835 Missense_Mutation 14 52 0.2692308 PKc_like
## 5: FLT3 599 In_Frame_Ins 10 52 0.1923077 PKc_like
## 6: U2AF1 34 Missense_Mutation 7 8 0.8750000 zf-CCCH
## DomainLabel nMuts nGenes
## 1: PKc_like 55 5
## 2: PTZ00435 38 2
## 3: AdoMet_MTases 33 1
## 4: 7tm_1 24 24
## 5: COG5048 17 17
## 6: Cadherin_repeat 16 16
To perform survival analysis, we can use “mafSurvival” function. The function returns a data frame with survival information, such as the number of subjects at risk, the number of events, the survival probability, and the median survival time for each group. The function can also take multiple genes for grouping and perform survival analysis for each gene separately.
For predicting associated genes with survival, we can use the “survGroup” function takes a MAF file as input and identifies a set of genes associated with poor survival. The top parameter specifies the number of top mutated genes to consider, and the “geneSetSize” parameter sets the size of the gene set to predict.
The mafSurvGroup function is used to plot survival curves for multiple gene sets in one plot. It takes a MAF file object as input and the “geneSet” parameter specifies the gene set(s) to compare. The “time” parameter indicates the column in the MAF file containing the time variable, and the “Status” parameter indicates the column containing the status variable. The function returns a survival plot with a legend for each gene set. The x-axis represents the survival time, and the y-axis represents the probability of survival.
## DNMT3A
## 48
## Group medianTime N
## 1: Mutant 245 45
## 2: WT 396 137
## Gene_combination P_value hr WT Mutant
## 1: FLT3_DNMT3A 0.00104 2.510 164 18
## 2: DNMT3A_SMC3 0.04880 2.220 176 6
## 3: DNMT3A_NPM1 0.07190 1.720 166 16
## 4: DNMT3A_TET2 0.19600 1.780 176 6
## 5: FLT3_TET2 0.20700 1.860 177 5
## 6: NPM1_IDH1 0.21900 0.495 176 6
## 7: DNMT3A_IDH1 0.29300 1.500 173 9
## 8: IDH2_RUNX1 0.31800 1.580 176 6
## 9: FLT3_NPM1 0.53600 1.210 165 17
## 10: DNMT3A_IDH2 0.68000 0.747 178 4
## 11: DNMT3A_NRAS 0.99200 0.986 178 4
## Group medianTime N
## 1: Mutant 242.5 18
## 2: WT 379.5 164
In comparing two cohorts from different MAF files, we first read them as mentioned in the first section. Here use Primary APL and Relapse APL for the explanation. First, we use “mafCompare” function to compare two cohorts and save in “pt.vs.rt” data frame.
The “forestPlot” function generates a forest plot that displays each mutation’s effect size and confidence interval for comparing two groups. The mafCompareRes parameter takes the output of the “mafCompare” function, and pVal sets the threshold for significance.
Similar to the Oncoplot function , the “coOncoplot” function has a gene parameter that can assign an array of genes to compare among cohorts. As a result, it generates a co-occurring oncogene plot that compares the frequency of mutations in the given set of genes between two groups.
## -Reading
## -Validating
## --Non MAF specific values in Variant_Classification column:
## ITD
## -Silent variants: 45
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.098s elapsed (0.092s cpu)
## -Reading
## -Validating
## --Non MAF specific values in Variant_Classification column:
## ITD
## -Silent variants: 19
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.076s elapsed (0.073s cpu)
## $results
## Hugo_Symbol Primary Relapse pval or ci.up ci.low
## 1: PML 1 11 1.529935e-05 0.03537381 0.2552937 0.000806034
## 2: RARA 0 7 2.574810e-04 0.00000000 0.3006159 0.000000000
## 3: RUNX1 1 5 1.310500e-02 0.08740567 0.8076265 0.001813280
## 4: FLT3 26 4 1.812779e-02 3.56086275 14.7701728 1.149009169
## 5: ARID1B 5 8 2.758396e-02 0.26480490 0.9698686 0.064804160
## 6: WT1 20 14 2.229087e-01 0.60619329 1.4223101 0.263440988
## 7: KRAS 6 1 4.334067e-01 2.88486293 135.5393108 0.337679367
## 8: NRAS 15 4 4.353567e-01 1.85209500 8.0373994 0.553883512
## 9: ARID1A 7 4 7.457274e-01 0.80869223 3.9297309 0.195710173
## adjPval
## 1: 0.0001376942
## 2: 0.0011586643
## 3: 0.0393149868
## 4: 0.0407875250
## 5: 0.0496511201
## 6: 0.3343630535
## 7: 0.4897762916
## 8: 0.4897762916
## 9: 0.7457273717
##
## $SampleSummary
## Cohort SampleSize
## 1: Primary 124
## 2: Relapse 58
## HGNC refseq.ID protein.ID aa.length
## 1: PML NM_002675 NP_002666 633
## 2: PML NM_033238 NP_150241 882
## 3: PML NM_033239 NP_150242 829
## 4: PML NM_033240 NP_150243 611
## 5: PML NM_033244 NP_150247 560
## 6: PML NM_033246 NP_150249 423
## 7: PML NM_033247 NP_150250 435
## 8: PML NM_033249 NP_150252 585
## 9: PML NM_033250 NP_150253 781
## HGNC refseq.ID protein.ID aa.length
## 1: PML NM_002675 NP_002666 633
## 2: PML NM_033238 NP_150241 882
## 3: PML NM_033239 NP_150242 829
## 4: PML NM_033240 NP_150243 611
## 5: PML NM_033244 NP_150247 560
## 6: PML NM_033246 NP_150249 423
## 7: PML NM_033247 NP_150250 435
## 8: PML NM_033249 NP_150252 585
## 9: PML NM_033250 NP_150253 781
In the enrichment analysis, the clinicalEnrichment function specified clinical features equal to “FAB_classification,” which divides leukemia into M0 to M7 stages and is used as annotation for the “laml” MAF file, saved as “fab.ce” object.
In the “fab.ce” object, we can access the groupwise_comparision data by a dollar sign and filter out data p-value equal to or higher than 5%. Lastly, we can visualize the result of enrichment by the plotEnrichmentResults function.
##
## M0 M1 M2 M3 M4 M5 M6 M7
## 19 44 44 21 39 19 3 3
## Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2 p_value
## 1: IDH1 M1 Rest 11 of 44 7 of 149 0.0002597371
## 2: TP53 M7 Rest 3 of 3 12 of 190 0.0003857187
## 3: DNMT3A M5 Rest 10 of 19 38 of 174 0.0089427384
## 4: CEBPA M2 Rest 7 of 44 6 of 149 0.0117352110
## 5: RUNX1 M0 Rest 5 of 19 11 of 174 0.0117436825
## 6: NPM1 M5 Rest 7 of 19 26 of 174 0.0248582372
## 7: NPM1 M3 Rest 0 of 21 33 of 172 0.0278630823
## 8: DNMT3A M3 Rest 1 of 21 47 of 172 0.0294005111
## OR OR_low OR_high fdr
## 1: 6.670592 2.173829026 21.9607250 0.0308575
## 2: Inf 5.355415451 Inf 0.0308575
## 3: 3.941207 1.333635173 11.8455979 0.3757978
## 4: 4.463237 1.204699322 17.1341278 0.3757978
## 5: 5.216902 1.243812880 19.4051505 0.3757978
## 6: 3.293201 1.001404899 10.1210509 0.5880102
## 7: 0.000000 0.000000000 0.8651972 0.5880102
## 8: 0.133827 0.003146708 0.8848897 0.5880102
The “drugInteractions” function is to identify potential drug-gene interactions based on mutations in genes that are known to be associated with drug response. It takes a MAF file as input and returns a data frame with potential drug-gene interactions, including the gene’s name, the type of mutation, the drugs associated with the gene, and the evidence supporting that interaction.
In the example, we search the interaction of the drug by specifying the gene “DNMT3A” and save into “dnmt3a.dgi” object. Then we can further print out the columns we are interested in, like Gene, interaction_types, and drug_name, as illustrated.
## Number of claimed drugs for given genes:
## Gene N
## 1: DNMT3A 7
## Gene interaction_types drug_name drug_claim_name
## 1: DNMT3A N/A
## 2: DNMT3A DAUNORUBICIN Daunorubicin
## 3: DNMT3A DECITABINE Decitabine
## 4: DNMT3A IDARUBICIN IDARUBICIN
## 5: DNMT3A DECITABINE DECITABINE
## 6: DNMT3A inhibitor DECITABINE CHEMBL1201129
## 7: DNMT3A inhibitor AZACITIDINE CHEMBL1489
The OncogenicPathways function generates a summary table of the oncogenic pathways identified in the MAF file and the number of samples each pathway affects. The “PlotOncogenicPathways” function can be used to plot the genes involved in a specific oncogenic pathway, along with their mutation status in the samples. These functions are useful for exploring the oncogenic pathways commonly altered in a particular cancer type and understanding the potential targets for therapy.
## Pathway N n_affected_genes fraction_affected Mutated_samples
## 1: PI3K 29 1 0.03448276 1
## 2: NRF2 3 1 0.33333333 1
## 3: TP53 6 2 0.33333333 15
## 4: WNT 68 3 0.04411765 4
## 5: MYC 13 3 0.23076923 3
## 6: NOTCH 71 6 0.08450704 8
## 7: Hippo 38 7 0.18421053 7
## 8: RTK-RAS 85 18 0.21176471 97
## Fraction_mutated_samples
## 1: 0.005181347
## 2: 0.005181347
## 3: 0.077720207
## 4: 0.020725389
## 5: 0.015544041
## 6: 0.041450777
## 7: 0.036269430
## 8: 0.502590674